Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

120 ◾ Bioinformatics

Notice that we added a RG header “@GR” with “ID” and “SM” to each SAM file to pro-

vide the sample information that will be used later by the variant caller to create a genotype

column for each sample.

4.2.1.1.5 Converting SAM into BAM format

The “bwa mem” aligner created a SAM file. We can convert the SAM files into a BAM for-

mat by running the following script. Make sure that you are running the script from the

main project directory.

mkdir bam

cd sam

for i in $(ls *.sam | rev | cut -c 5- | rev);

samtools view -uS -o ../bam/${i}.bam ${i}.sam

done

cd ..

The above script will create the subdirectory “bam” and convert the SAM files into BAM

files and stores them in the “bam” subdirectory.

4.2.1.1.6 Sorting and indexing BAM files

The BAM files must be sorted and indexed before the next step of the analysis. The follow-

ing bash script creates a new subdirectory “sortedbam”, sorts the bam files, and saves the

new BAM files with sorted alignment in the new subdirectory:

mkdir sortedbam

cd bam

for i in $(ls *.bam);

samtools sort -T ../sortedbam/tmp.sort -o ../sortedbam/${i} ${i}

done

cd ..

cd sortedbam

for i in $(ls *.bam);

samtools index ${i}

done

cd ..

4.2.1.1.7 Marking duplicates

The identical reads mapped to the reference genome may bias variant calling. The dupli-

cate reads are created by the PCR amplification during sequencing. Usually, using small

amount of DNA for library preparation may lead to more duplication. Some sequencing

protocols use Unique Molecular Identifiers (UMIs) to distinguish between the artifact and

natural duplicates. Marking or removing duplicate aligned reads is a common best practice